R code for this article

(Previously: Parts 1 and 2 of this series)

One of the easiest statistical mistakes to make is to assume that a sample is representative of its parent population, when in fact the way it was collected has led to a selection bias. Perhaps the most famous example of this is survivorship bias, as illustrated in the story of Abraham Wald and the Missing Bullet Holes. In this article I will discuss a paradox that arises from another type of selection bias; collider bias.

Imagine that we have a virus ‘colliditis’, and that the severity of disease experienced by people who catch colliditis is uncorrelated with the number of cigarettes they smoke in a day. For the sake of argument we assume that the average number of cigarettes of smoked a day and the severity of disease are uniformly distributed in the population of people who have caught colliditis. So if we plotted a random sample of one thousand people from the population of people with the viral infection, it would look like this.

n <- 1000 
infected_sample <- matrix(0, nrow=n, ncol=2) # we initialise an empty matrix to store the data for each person
for (i in 1:n) {
  severity <- runif(1, min=0, max=20) # we assign each person in the sample a random severity score between 0 and 20
  cigarettes <- runif(1,0,20) # we assign each person in the sample a random average number of cigarettes smoked per day between 0 and 20
  infected_sample[i,1] <- severity
  infected_sample[i,2] <- cigarettes
}
plot(infected_sample[,1], infected_sample[,2], pch=4, cex=1.2, lwd=2.4, col="#808080",
     xlab="Colliditis Severity Score", ylab="Average Number of Cigarettes Smoked in a Day")
abline(lm(infected_sample[,1] ~ infected_sample[,2]), lwd=4.8, col="#009ed8",) # we plot the line of best fit

Now suppose that out of this sample, 90% the people with a colliditis severity score ≥15 are in hospital, and of those with a colliditis severity score <15, those who smoke ≥10 cigarettes a day on average have a 60% chance of being in hospital, compared to a 10% chance for those who smoke <10 cigarettes a day on average. If we plot cigarettes vs colliditis severity of only those in our sample who are hospitalised, we see an interesting result.

# First we initialise a new matrix to store the data for each person in our sample who is hospitalised –
# those people who are not hospitalised will have NA NA in place of data for their rows of the matrix,
# and those who are in hospital will have a matching row to our infected_sample matrix:
hospital_infected_sample <- matrix(rep(NA, 2*n), nrow=n, ncol=2) 
for (i in 1:n) {
  dice_roll <- runif(1, min=0, max=1) # we generate random number between 0 and 1
  person_i <- infected_sample[i,]
  if(person_i[1] >= 15 && dice_roll <= 0.9) { # these people have a 90% chance of being in hospital
    hospital_infected_sample[i,] <- person_i
  } else if(person_i[2] >= 10 && dice_roll <= 0.6) { # these people have a 60% chance of being in hospital
    hospital_infected_sample[i,] <- person_i
  }
  else if(dice_roll <= 0.1) { # others have a 10% chance of being in hospital
    hospital_infected_sample[i,] <- person_i
  }
}
# Now we remove the people not in hospital from our sample:
hospital_infected_sample <- na.omit(hospital_infected_sample)
# And plot the result:
plot(hospital_infected_sample[,1], hospital_infected_sample[,2], pch=4, cex=1.2, lwd=2.4, col="#808080",
     xlab="Colliditis Severity Score", ylab="Average Number of Cigarettes Smoked in a Day")
abline(lm(hospital_infected_sample[,1] ~ hospital_infected_sample[,2]), lwd=4.8, col="#009ed8") # we plot the line of best fit

If we took the hospitalised population as representative of the general population we would find that there is a negative association between average number of cigarettes smoked in a day and severity of colliditis. But we would be failing to account for the fact that having a severe case of colliditis, and smoking a large number of cigarettes per day are both factors that make you more likely to be in hospital. So if you are in hospital with a low colliditis score, you are more likely to smoke a lot of cigarettes than if you are in hospital with a high colliditis score, since there must be some other reason for your hospitalisation than colliditis, and this reason may be smoking. We can represent the relationship between these factors using a probabilistic graphical model.

We can’t draw any conclusions about the effect of the average number of cigarettes someone smokes in a day on the severity of their disease from a group of hospitalised patients with colliditis because the probability of them being included in the sample is dependent on both of these variables. This is an illustration of Berkson’s paradox, the phenomenon whereby two factors falsely appear to have a correlation because they both influence selection into a sample. In this article, Sanjay Srivastava gives a fun example of Berkson’s paradox; he recalls believing burger quality and fry quality were negatively correlated in his town burger restaurants, not accounting for the fact that he never visited the restaurants where both burgers and fries were poor quality. The bias arising from multiple characteristics influencing the chance of selection into a sample is known as collider bias1, and we say that these characteristics ‘collide’ on selection2.

Home Page


  1. Hopin Lee, Jeffrey Aronson, and David Nunan, “Collider bias,” Catalogue of Bias, 2019.↩︎

  2. Annie Herbert et al., “The spectre of Berkson’s paradox: Collider bias in Covid-19 research,” Significance 17, no. 4 (August 2020): 6–7.↩︎